Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Wilson Score Confidence Interval Strategy #567

Merged
merged 13 commits into from
May 24, 2024

Conversation

zeotuan
Copy link
Contributor

@zeotuan zeotuan commented May 6, 2024

Fixes #563

Description of changes:

  • Update RetainCompletenessRules and FractionalCategoricalRangeRule to accept and configure ConfidenceIntervalStrategy parameter
  • Add Wilson Score Confidence Interval Strategy and Wald Interval Strategy (current default)
  • Make Wilson Score Confidence Interval Strategy the new default method

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@zeotuan zeotuan changed the title Tpm/interval strategy Adding Wilson Score Confidence Interval Strategy May 6, 2024
@zeotuan
Copy link
Contributor Author

zeotuan commented May 13, 2024

@rdsharma26 Hi can you help review this PR.

object ConfidenceIntervalStrategy {
val defaultConfidence = 0.95

case class ConfidenceInterval(lowerBound: Double, upperBound: Double)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently also calculate upperBound for these ConfidenceInterval. At the moment we don't actually make use of the upperBound though

@rdsharma26
Copy link
Contributor

Thank you for the PR! @zeotuan
The changes look good to me. Can you fix the failing build ?

One point I would like to discuss is making the Wilson Score Confidence Interval Strategy the new default. Could this potentially break backwards compatibility in terms of behavior? If so, should we stick to the Wald Interval Strategy as the default and update the documentation so that users of Deequ can choose which one they want?

@zeotuan
Copy link
Contributor Author

zeotuan commented May 20, 2024

Thank you for the PR! @zeotuan The changes look good to me. Can you fix the failing build ?

One point I would like to discuss is making the Wilson Score Confidence Interval Strategy the new default. Could this potentially break backwards compatibility in terms of behavior? If so, should we stick to the Wald Interval Strategy as the default and update the documentation so that users of Deequ can choose which one they want?

I think making Wilson Score Confidence Interval default right now might introduce some surprising changes to existing data quality pipelines. I will add example usage documentation for this.
We can potentially introduce plan to change the default in a major version update and add "deprecation" message for now so user migrate themselves.

@rdsharma26
Copy link
Contributor

rdsharma26 commented May 21, 2024

We can potentially introduce plan to change the default in a major version update and add "deprecation" message for now so user migrate themselves.

This seems like a safe approach. Could we configure the default to be the Wald strategy in the following line?

 private val defaultIntervalStrategy: ConfidenceIntervalStrategy = WilsonScoreIntervalStrategy()

The build also failed due to:

error file=/home/runner/work/deequ/deequ/src/test/scala/com/amazon/deequ/suggestions/rules/interval/IntervalStrategyTest.scala message=expected start of definition, but was Token(VAL,val,1285,val)

@@ -23,16 +23,17 @@ import com.amazon.deequ.metrics.DistributionValue
import com.amazon.deequ.profiles.ColumnProfile
import com.amazon.deequ.suggestions.ConstraintSuggestion
import com.amazon.deequ.suggestions.ConstraintSuggestionWithValue
import com.amazon.deequ.suggestions.rules.FractionalCategoricalRangeRule.defaultIntervalStrategy
import com.amazon.deequ.suggestions.rules.interval.{ConfidenceIntervalStrategy, WilsonScoreIntervalStrategy}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we avoid grouped imports and use one import per line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, do we prefer separate import or single import but with each on a single line

import com.amazon.deequ.suggestions.rules.interval.ConfidenceIntervalStrategy
import com.amazon.deequ.suggestions.rules.interval.WilsonScoreIntervalStrategy

or

import com.amazon.deequ.suggestions.rules.interval{
  ConfidenceIntervalStrategy,
  WilsonScoreIntervalStrategy
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former. It helps with automatic resolution of merge conflicts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@rdsharma26
Copy link
Contributor

Left a comment and a unit test needs fixing.

Copy link
Contributor

@rdsharma26 rdsharma26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @zeotuan for your continued contribution to Deequ!

@rdsharma26 rdsharma26 merged commit 101142e into awslabs:master May 24, 2024
1 check passed
eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import
eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import
eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import
eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import
mentekid pushed a commit that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule (#564)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Optional specification of instance name in CustomSQL analyzer metric. (#569)

Co-authored-by: Tyler Mcdaniel <[email protected]>

* Adding Wilson Score Confidence Interval Strategy (#567)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import

* CustomAggregator (#572)

* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <[email protected]>

* fix typo (#574)

* Fix performance of building row-level results (#577)

* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID

* Replace 'withColumns' with 'select' (#582)

'withColumns' was introduced in Spark 3.3, so it won't
work for Deequ's <3.3 builds.

* Replace rdd with dataframe functions in Histogram analyzer (#586)

Co-authored-by: Shriya Vanvari <[email protected]>

* Updated version in pom.xml to 2.0.8-spark-3.4

---------

Co-authored-by: zeotuan <[email protected]>
Co-authored-by: tylermcdaniel0 <[email protected]>
Co-authored-by: Tyler Mcdaniel <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: bojackli <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
mentekid pushed a commit that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule (#564)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Optional specification of instance name in CustomSQL analyzer metric. (#569)

Co-authored-by: Tyler Mcdaniel <[email protected]>

* Adding Wilson Score Confidence Interval Strategy (#567)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import

* CustomAggregator (#572)

* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <[email protected]>

* fix typo (#574)

* Fix performance of building row-level results (#577)

* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID

* Replace 'withColumns' with 'select' (#582)

'withColumns' was introduced in Spark 3.3, so it won't
work for Deequ's <3.3 builds.

* Replace rdd with dataframe functions in Histogram analyzer (#586)

Co-authored-by: Shriya Vanvari <[email protected]>

* Match Breeze version with spark 3.3 (#562)

* Updated version in pom.xml to 2.0.8-spark-3.3

---------

Co-authored-by: zeotuan <[email protected]>
Co-authored-by: tylermcdaniel0 <[email protected]>
Co-authored-by: Tyler Mcdaniel <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: bojackli <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
mentekid pushed a commit that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule (#564)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Optional specification of instance name in CustomSQL analyzer metric. (#569)

Co-authored-by: Tyler Mcdaniel <[email protected]>

* Adding Wilson Score Confidence Interval Strategy (#567)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import

* CustomAggregator (#572)

* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <[email protected]>

* fix typo (#574)

* Fix performance of building row-level results (#577)

* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID

* Replace 'withColumns' with 'select' (#582)

'withColumns' was introduced in Spark 3.3, so it won't
work for Deequ's <3.3 builds.

* Replace rdd with dataframe functions in Histogram analyzer (#586)

Co-authored-by: Shriya Vanvari <[email protected]>

* Updated version in pom.xml to 2.0.8-spark-3.2

---------

Co-authored-by: zeotuan <[email protected]>
Co-authored-by: tylermcdaniel0 <[email protected]>
Co-authored-by: Tyler Mcdaniel <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: bojackli <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
mentekid pushed a commit that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule (#564)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Optional specification of instance name in CustomSQL analyzer metric. (#569)

Co-authored-by: Tyler Mcdaniel <[email protected]>

* Adding Wilson Score Confidence Interval Strategy (#567)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import

* CustomAggregator (#572)

* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <[email protected]>

* fix typo (#574)

* Fix performance of building row-level results (#577)

* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID

* Replace 'withColumns' with 'select' (#582)

'withColumns' was introduced in Spark 3.3, so it won't
work for Deequ's <3.3 builds.

* Replace rdd with dataframe functions in Histogram analyzer (#586)

Co-authored-by: Shriya Vanvari <[email protected]>

* pdated version in pom.xml to 2.0.8-spark-3.1

---------

Co-authored-by: zeotuan <[email protected]>
Co-authored-by: tylermcdaniel0 <[email protected]>
Co-authored-by: Tyler Mcdaniel <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: bojackli <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
arsenalgunnershubert777 pushed a commit to arsenalgunnershubert777/deequ that referenced this pull request Nov 8, 2024
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support Wilson Score Interval for RetainCompletenessRule
2 participants